source-mysql: Decode text correctly in non-UTF8 character sets #1979

willdonnelly · 2024-09-24T19:58:12Z

Description:

This PR fixes source-mysql to correctly decode text columns with non-UTF8 character sets. This fixes #1951

Previously backfilled values were correct, because MySQL always sends us those in UTF-8, but replicated values were occasionally incorrect (or at least, sub-optimal) because the bytes stored in the binlog are in whatever character set the column uses, and at a certain point we just cast []byte -> string which results in any non-ASCII bytes getting replaced with U+FFFD REPLACEMENT CHARACTER.

The fix is:

Keep track of the character set of each column (this is the hard part, there's a lot of subtleties to making it work correctly even in oddball cases like a DDL ALTER TABLE query occurring after the table backfill which adds a new column where the character set could be explicit, could be implied by an explicit collation, or could be the table default).
Modify the decoding logic to apply an appropriate []byte -> string decode function to text column bytes received via replication, based on the character set of the column. This part is reasonably straightforward, but required adding an extra parameter to the value translation code to distinguish between backfilled values (which are always UTF-8) and replicated ones (where we need to do all this fanciness).
Implement a handful of decoding functions. Currently we've got UTF-8, latin-1, and UCS-2 decoders. There are a couple more which might theoretically come up in the future but which I didn't think warranted worrying about right now. Honestly even UCS-2 is unlikely to ever get used in the real world, I just added it because as a double-byte character set it was a good way of exercising some of the edge cases and clearly demonstrating working vs non-working behavior.

I have opted to make the connector default to a UTF-8 decoder when faced with an unknown or unrecognized charset rather than failing the capture. This would probably be the wrong choice if the connector were new, but right now we don't know if there are any captures in production using some other character set which I overlooked, so the safest behavior is to keep doing the same thing as before this PR (decoding the bytes by casting to a UTF-8 string directly) while logging an error. Later on I can go check to see if that error actually occurred in production anywhere, add support for the offending character sets, and consider making "unknown charset" a fatal error at that time.

Workflow steps:

Most captures don't need to care or do anything. Captures will continue operating as normal for preexisting bindings (because their metadata for all tables and columns has no charset information, which means an implicit default of UTF-8). New captures, or new bindings or backfills on existing captures, will reinitialize that metadata and pick up the appropriate charset information. In most cases this will have no impact because the charset really is UTF-8.

Captures of latin-1 or other non-Unicode text columns which actually contain non-ASCII code points currently being captured as U+FFFD REPLACEMENT CHARACTER can be fixed to capture the correct values for these non-ASCII characters by re-backfilling the binding.

This change is

This is a new test case which exercises `text` columns with various different colations / character sets storing some test strings with interesting characters from various languages. I believe the current set of collations and test strings is enough to demonstrate the known issues with our current handling of these edge cases: - A `latin1` collation stores strings in the latin-1 character set and then we cast those raw bytes to a string which causes all of the non-ASCII characters to be replaced with U+FFFD. - A `ucs2` collation stores strings in the UCS-2 DBCS and so for similar reasons all replicated values are terribly mangled. - A `binary` collation is apparently captured as base64 bytes, because apparently if you tell MySQL `TEXT COLLATE binary` it creates a `BLOB` column. This is not an error but just seemed worth noting and including in the test here. The base64'd text appears to be a faithful base64 representation of the input as a UTF-8 string.

Modifies the column-discovery and primary-key-discovery queries to apply the same "not in a system schema" filtering that the table discovery query has, and then modifies column discovery so that the raw information is logged for each column.

This commit adds most of the necessary plumbing to keep track of the character set of a text column and apply a charset-aware decode function instead of just casting the bytes to a string. However it does not actually implement the proper decoding and instead just still does `var str = string(bs)` at the appropriate spot with a TODO noting that's wrong. The next commit will come in and actually implement proper decoding.

Currently this doesn't work and I've stubbed out the logic with a hard-coded default of `utf8mb4` but now there's a test case which will show the corrected values when I fix that.

This works just fine, which is to be expected because we haven't changed anything and we already knew that latin-1 character set text columns work fine under backfills.

This more or less preserves the old behavior for any charsets we haven't thought to add to the decoders map, but logs an error so I can check for it in a few days and add any others we might be missing.

And use that collation when processing DDL alterations which don't explicitly specify another collation or charset.

In this case we want to apply the same "charset from collation" mapping function that we use during discovery. Now the hierarchy of column charsets goes: 1. Explicit CHARSET declaration 2. Explicit COLLATE declaration 3. Default for the table (which omits utf8mb4 in some cases) 4. Default to utf8mb4 as the last resort

jgraettinger

LGTM

It was always the plan to include the table charset in the table metadata unconditionally, I just left that part for a followup commit to keep the diffs separate. This removes the `"utf8mb4" -> ""` omission from the replication code so that it's always made explicit what charset a table is using in metadata initialized after this change. The default of `"" -> "utf8mb4"` still exists in the DDL alteration datatype translation so that will always be explicit for newly added columns, and likewise the string decoding function defaults `"" -> "utf8mb4"` so that old metadata works, but after this change we always specify charset information explicitly when generating new metadata.

willdonnelly added 10 commits September 24, 2024 14:42

source-mysql: Test snapshot updates

4ec1ce5

source-mysql: Add test case of DDL adding non-UTF8 text column

c0befef

Currently this doesn't work and I've stubbed out the logic with a hard-coded default of `utf8mb4` but now there's a test case which will show the corrected values when I fix that.

source-mysql: Add a test case of backfilling a latin1 text key

5e91609

This works just fine, which is to be expected because we haven't changed anything and we already knew that latin-1 character set text columns work fine under backfills.

source-mysql: Apply charset appropriate text decoding

f560b57

source-mysql: Assume unknown charsets are UTF-8 compatible

1b52701

This more or less preserves the old behavior for any charsets we haven't thought to add to the decoders map, but logs an error so I can check for it in a few days and add any others we might be missing.

source-mysql: Keep track of the default table collation

55ca5a0

And use that collation when processing DDL alterations which don't explicitly specify another collation or charset.

willdonnelly added the change:planned This is a planned change label Sep 24, 2024

willdonnelly requested a review from a team September 24, 2024 19:58

jgraettinger approved these changes Sep 27, 2024

View reviewed changes

willdonnelly force-pushed the wgd/debugging-20240916 branch from e0017f0 to 391d7c7 Compare September 30, 2024 17:23

willdonnelly merged commit cf8ef4d into main Sep 30, 2024
50 of 53 checks passed

willdonnelly deleted the wgd/debugging-20240916 branch September 30, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-mysql: Decode text correctly in non-UTF8 character sets #1979

source-mysql: Decode text correctly in non-UTF8 character sets #1979

willdonnelly commented Sep 24, 2024 •

edited by jgraettinger

Loading

jgraettinger left a comment

source-mysql: Decode text correctly in non-UTF8 character sets #1979

source-mysql: Decode text correctly in non-UTF8 character sets #1979

Conversation

willdonnelly commented Sep 24, 2024 • edited by jgraettinger Loading

jgraettinger left a comment

Choose a reason for hiding this comment

willdonnelly commented Sep 24, 2024 •

edited by jgraettinger

Loading